Goto

Collaborating Authors

 data transfer


Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Yang, Xinjun, Hu, Qingda, Li, Junru, Li, Feifei, Zhu, Yicong, Zhou, Yuqi, Lin, Qiuru, Dai, Jian, Kong, Yang, Zhang, Jiayu, Xu, Guoqiang, Liu, Qiang

arXiv.org Artificial Intelligence

The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.


CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Yi, Jiawei, Gong, Ping, Bai, Youhui, Ruan, Jiaqi, Wang, Shengnan, Wang, Pengcheng, Wang, Haibo, Wang, Weiguang, Zhu, Xia, Wu, Feng, Li, Cheng

arXiv.org Artificial Intelligence

The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.


TASP: Topology-aware Sequence Parallelism

Wang, Yida, Hong, Ke, Li, Xiuhong, Xu, Yuanchao, Wang, Wenxun, Dai, Guohao, Wang, Yu

arXiv.org Artificial Intelligence

Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.


FastTrack: GPU-Accelerated Tracking for Visual SLAM

Khabiri, Kimia, Hosseininejad, Parsa, Gopinath, Shishir, Dantu, Karthik, Ko, Steven Y.

arXiv.org Artificial Intelligence

The tracking module of a visual-inertial SLAM system processes incoming image frames and IMU data to estimate the position of the frame in relation to the map. It is important for the tracking to complete in a timely manner for each frame to avoid poor localization or tracking loss. We therefore present a new approach which leverages GPU computing power to accelerate time-consuming components of tracking in order to improve its performance. These components include stereo feature matching and local map tracking. We implement our design inside the ORB-SLAM3 tracking process using CUDA. Our evaluation demonstrates an overall improvement in tracking performance of up to 2.8x on a desktop and Jetson Xavier NX board in stereo-inertial mode, using the well-known SLAM datasets EuRoC and TUM-VI.


An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

Antonio, Ryan Albert, Dumoulin, Joren, Yi, Xiaoling, Van Delm, Josse, Deng, Yunhao, Paim, Guilherme, Verhelst, Marian

arXiv.org Artificial Intelligence

Heterogeneous accelerator-centric compute clusters are emerging as efficient solutions for diverse AI workloads. However, current integration strategies often compromise data movement efficiency and encounter compatibility issues in hardware and software. This prevents a unified approach that balances performance and ease of use. To this end, we present SNAX, an open-source integrated HW-SW framework enabling efficient multi-accelerator platforms through a novel hybrid-coupling scheme, consisting of loosely coupled asynchronous control and tightly coupled data access. SNAX brings reusable hardware modules designed to enhance compute accelerator utilization, and its customizable MLIR-based compiler to automate key system management tasks, jointly enabling rapid development and deployment of customized multi-accelerator compute clusters. Through extensive experimentation, we demonstrate SNAX's efficiency and flexibility in a low-power heterogeneous SoC. Accelerators can easily be integrated and programmed to achieve > 10x improvement in neural network performance compared to other accelerator systems while maintaining accelerator utilization of > 90% in full system operation.


PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices

Liu, Yangyijian, Li, Jun, Li, Wu-Jun

arXiv.org Artificial Intelligence

The high memory and computation demand of large language models (LLMs) makes them challenging to be deployed on consumer devices due to limited GPU memory. Offloading can mitigate the memory constraint but often suffers from low GPU utilization, leading to low inference efficiency. In this work, we propose a novel framework, called pipelined offloading (PIPO), for efficient inference on consumer devices. PIPO designs a fine-grained offloading pipeline, complemented with optimized data transfer and computation, to achieve high concurrency and efficient scheduling for inference. Experimental results show that compared with state-of-the-art baseline, PIPO increases GPU utilization from below 40% to over 90% and achieves up to 3.1$\times$ higher throughput, running on a laptop equipped with a RTX3060 GPU of 6GB memory.


Transfer data from your Android phone to your Windows PC: The ultimate guide

PCWorld

Nowadays, smartphones replace the (video) camera on holiday, acts as a portable music player, has space for all WhatsApp media, and holds audio plays, e-books, and documents. To avoid losing such data, you should create regular backups and your home Windows PC is ideal for this. The home computer is also a good data source, as it often houses downloads, music libraries, and video archives. However, if you want to transfer music, videos, or images between your smartphone and a Windows PC, you are spoiled for choice. There are a whole range of different methods available for this data transfer. The simplest and quickest method of connecting an Android device to your Windows PC is the classic USB cable.


Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study

Zhu, Jianwei, Yin, Hang, Deng, Peng, Zhou, Shunfan

arXiv.org Artificial Intelligence

This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on nVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various LLMs and token lengths, with a particular focus on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results indicate that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily attributable to data transfer. For the majority of typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing nearly zero overhead.


Employing Artificial Intelligence to Steer Exascale Workflows with Colmena

Ward, Logan, Pauloski, J. Gregory, Hayot-Sasson, Valerie, Babuji, Yadu, Brace, Alexander, Chard, Ryan, Chard, Kyle, Thakur, Rajeev, Foster, Ian

arXiv.org Artificial Intelligence

Computational workflows are a common class of application on supercomputers, yet the loosely coupled and heterogeneous nature of workflows often fails to take full advantage of their capabilities. We created Colmena to leverage the massive parallelism of a supercomputer by using Artificial Intelligence (AI) to learn from and adapt a workflow as it executes. Colmena allows scientists to define how their application should respond to events (e.g., task completion) as a series of cooperative agents. In this paper, we describe the design of Colmena, the challenges we overcame while deploying applications on exascale systems, and the science workflows we have enhanced through interweaving AI. The scaling challenges we discuss include developing steering strategies that maximize node utilization, introducing data fabrics that reduce communication overhead of data-intensive tasks, and implementing workflow tasks that cache costly operations between invocations. These innovations coupled with a variety of application patterns accessible through our agent-based steering model have enabled science advances in chemistry, biophysics, and materials science using different types of AI. Our vision is that Colmena will spur creative solutions that harness AI across many domains of scientific computing.


Unlocking the Potential of Binding Corporate Rules (BCRs) in Health Data Transfers

Compagnucci, Marcelo Corrales, Fenwick, Mark, Haapio, Helena

arXiv.org Artificial Intelligence

This chapter explores the essential role of Binding Corporate Rules (BCRs) in managing and facilitating secure health data transfers within corporate groups under the EU General Data Protection Regulation (GDPR). BCRs are tailored to ensure compliance with the GDPR and similar international data protection laws, presenting a flexible mechanism for transferring sensitive health and genomic data. The chapter situates BCRs within the broader spectrum of the GDPR international data transfer mechanisms, addressing the unique challenges posed by the sensitive nature of health data and the increased adoption of AI technologies. The European Data Protection Board (EDPB) Recommendations 1/2022 on BCRs, issued following the Schrems II decision, are critically analyzed, highlighting their stringent requirements and the need for a balanced approach that prioritizes data protection and an AI governance framework. The chapter outlines the BCR approval process, stressing the importance of streamlining this process to encourage broader adoption. It underscores the necessity of a multidisciplinary approach in developing BCRs, incorporating recently adopted international standards and frameworks, which offer valuable guidance for organizations to build trustworthy AI management systems. They guarantee the ethical development, deployment, and operation of AI, which is essential for its successful integration and the broader digital transformation. In conclusion, BCRs are positioned as essential tools for secure health data management, fostering transparency, accountability, and collaboration across international borders. The chapter calls for proactive measures to incentivize BCR adoption, streamline approval processes, and promote more innovative approaches, ensuring BCRs remain a robust mechanism for global data protection and compliance.